-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIOAsyncWriter: Fix suspending yield when writer is finished #3044
NIOAsyncWriter: Fix suspending yield when writer is finished #3044
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch, thanks so much!
Regarding your test, it might be useful for the test to use a new delegate that makes it easier to arrange the ordering to work out. NIO's ConditionLock can be used to set up a nice ordering chain, so that instead of using the usleeps we can just block on that lock being unlocked with the right value.
bf18517
to
f15a87b
Compare
@Lukasa : I updated it with ConditionLock. Is this what you had in mind? This is still blocking a thread from the concurrency pool and then depending on another thread to unblock it. Will that not be a problem? The only way I see around that is to add an async hook to NIOAsyncWriter. |
For testing purposes, that blocking will be safe. We will eventually resolve it, and we don't rely on Dispatch to make that happen (it requires the NIO thread to do it), so it'll be ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really lovely patch, thanks so much @orobio!
I'm not sure I fully understand what you're saying here, but I assume it's fine then. I do still have that 'FIXME' line in there, which would probably be good to remove. Well..., this got merged while I was typing, so I'll open a new pull request for that. Edit: Opened #3059 |
This was left in when pull request apple#3044 was merged, but as discussed on that pull request, this doesn't need to be fixed.
This was left in when pull request #3044 was merged, but as discussed on that pull request, this doesn't need to be fixed.
It appears this pull gets the NIO tests to hang in the Android x86_64 emulator, but weirdly only when compiled and run on macOS, ie the linux CI runs shown there work fine. I don't know if that's a macOS cross-compilation issue or some incompatibility with the macOS emulator triggered by this pull, possibly both. Reverting this pull gets my Android CI runs on macOS to work again. Let me know what you guys think. |
That’s very interesting. Do you have a smaller repro? The OS-specificity makes me think this is probably a Swift bug, but if it’s easy enough to reproduce we may be able to figure out if the test that hangs is just flaky. |
A bit tough for me to reproduce since I don't use macOS. @marcprux, any interest in looking into this on your Mac? |
When I test against a local Android emulator (with
However, the test run does ultimately fail due to a leaked promise. Edit: see next comment.
Here's the leaked promise, although it is likely a red herring:
|
Actually, scratch all that. It seems to be hanging randomly, on both macOS and Ubuntu runners, irrespective of API level. See the runs at https://github.com/marcprux/swift-nio/actions/runs/12812276673 I'm still not able to get it to hang on my local machine, but I'll keep trying. |
Does it always hang on the same test? |
No, it seems to vary. Sometimes in
|
Yeah, that seems fairly likely. Can you run the underlying test binary directly? Potentially running it under |
On my Android CI, it is entirely consistent: linux CI runs pass every time while macOS runs always hang. See the run yesterday: across two attempts with this pull, I had four macOS hangs and six linux runs pass (ignore the devel macOS failures, that's a separate issue). So much so that today I only reverted this pull for macOS, finagolfin/swift-android-sdk@7476f2be5, and all the runs passed. |
@finagolfin What happens if you only revert the test? |
I don't think this test is the issue, as it appears to be failing elsewhere in the log, but I will check that and let you know. |
Nope, you were right to suspect the test, as simply reverting this test on macOS got the NIO tests to pass again. 😄 Marc, can you confirm? |
Confirmed: if I remove the test, then the hangs never occurs in the emulator (running on macOS-13 or Ubuntu). |
@marcprux, are you disabling all NIOAsyncWriterTests though, as shown in the linked commit above, marcprux/swift-nio@87d02310d? If so, it would be better to just disable this new test, as I did above, to see if this code change affects the older tests too, which you are no longer running. I'm going to try running the linux-compiled tests on the macOS emulator and vice versa next, to see if that helps narrow down where this strange hang is coming from, as it could be that we're hitting a bug in the Android emulator for macOS. |
This could be an emulator bug, but it could also be that we're seeing an ordering issue flushed out by the performance overhead of the emulator. It'd be useful to try to get a "sequence of events" log from your hanging test as well, as concurrent tests are horrible to debug. |
Alright, I ran the NIO tests alone in the Android emulator, once swapping the mac-compiled tests into the linux-hosted emulator and vice versa, once keeping the mac-built tests in the mac-hosted emulator and so on, where only the mac version hung. Looking at the results from the swapped run, the macos-hosted emulator again hung each time, with the linux-compiled tests not finishing for 20 minutes before I killed it (I have a 5 macOS job limit on the OSS free tier of github Actions, so I killed it once it seemed to hang, so other jobs could proceed). I just kicked off a swapped run again that I will let run for longer to make sure it does hang after longer (just killed it again after the macos-hosted emulators all hung for more than an hour and a half). This means the problem is in the interaction between this test and the macOS-hosted Android emulator, since the linux-compiled test runs fine in the linux-hosted emulator but hangs in the mac-hosted emulator. The mac cross-compile and mac-hosted Android emulator are consistently slower on github Actions, because of greater contention for fewer macOS CI machines or whatever, so either that lack of speed screws up this test or there is a bug in the macOS-hosted Android emulator. Regarding "a 'sequence of events' log," you want me to spray this test with print statements to figure out what's running when? |
Yeah, that was what I was asking for. Presumably this is the result of a specific ordering sequence caused by the slowdown in the emulator. |
This fixes hitting the following preconditionFailure in NIOAsyncWriter: "This should have already been handled by
yield()
".It doesn't expect a yield to be suspended when the state is
.writerFinished
, but this can definitely happen. This seemed to be the correct solution to me, but please check it carefully, because I'm unfamiliar with this code.I'm not happy about the test for this. It's blocking the
didYield
call for a while, to make sure the correct state is reached. See also the 'FIXME' line. What would be a better way to do this?